Reinforcement Learning via AIXI Approximation 



Joel Veness Kee Siong Ng 

University of NSW & NICTA Medicare Australia & ANU 



Marcus Hutter 
ANU & NICTA 



David Silver 
University College London 



JOELV (ff CSE.UNSW.EDU. AU 



KEESIONG.NG @ GMAIL.COM 



MARCUS.HUTTER® ANU.EDU.au DAVIDSTARSILVER@G00GLEMAIL.COM 



13 July 2010 



o 

(N 



a 

O 



> 

o 

(N 

o 
o 



X 



Abstract 

This paper introduces a principled approach for the design 
of a scalable general reinforcement learning agent. This 
approach is based on a direct approximation of AIXI, a 
Bayesian optimality notion for general reinforcement learn- 
ing agents. Previously, it has been unclear whether the the- 
ory of AIXI could motivate the design of practical algo- 
rithms. We answer this hitherto open question in the affir- 
mative, by providing the first computationally feasible ap- 
proximation to the AIXI agent. To develop our approxi- 
mation, we introduce a Monte Carlo Tree Search algorithm 
along with an agent-specific extension of the Context Tree 
Weighting algorithm. Empirically, we present a set of en- 
couraging results on a number of stochastic, unknown, and 
partially observable domains. 
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1 Introduction 

Consider an agent ffiat exists within some unknown envi- 
ronment. The agent interacts with the environment in cy- 
cles. At each cycle, the agent executes an action and re- 
ceives in turn an observation and a reward. The general re- 
inforcement learning problem is to construct an agent that, 
over time, collects as much reward as possible from an ini- 
tially unknown environment. 



The AIXI agent HHutOSI is a formal, mathematical so- 
lution to the general reinforcement learning problem. It 
can be decomposed into two main components: planning 
and prediction. Planning amounts to performing an expec- 
timax operation to determine each action. Prediction uses 
Bayesian model averaging, over the largest possible model 
class expressible on a Turing Machine, to predict future ob- 
servations and rewards based on past experience. AIXI is 
shown in HHutOSI to be optimal in ffie sense that it will 
rapidly learn an accurate model of the unknown environ- 
ment and exploit it to maximise its expected future reward. 

As AIXI is only asymptotically computable, it is by no 
means an algorithmic solution to the general reinforcement 
learning problem. Rather it is best understood as a Bayesian 
optimality notion for decision making in general unknown 
environments. This paper demonstrates, for the first time, 
how a practical agent can be built from the AIXI theory. 
Our solution directly approximates the planning and pre- 
diction components of AIXI. In particular, we use a gener- 
alisation of UCT IIKS06I to approximate the expectimax op- 
eration, and an agent-specific extension of CTW IIWST95I , 
a Bayesian model averaging algorithm for prediction suffix 
trees, for prediction and learning. Perhaps surprisingly, this 
kind of direct approximation is possible, practical and theo- 
retically appealing. Importantly, the essential characteristic 
of AIXI, its generality, can be largely preserved. 

2 The Agent Setting 

This section introduces the notation and terminology we 
will use to describe strings of agent experience, the true 
underlying environment and the agent's model of the en- 
vironment. 

The (finite) action, observation, and reward spaces are 
denoted by J{,0, and "R respectively. An observation- 
reward pair or is called a percept. We use X to denote the 
percept space OxH. 

Definition 1 A history h is an element of{J[ x X)* U (y[ x 
X)*x:n. 

Notation: A string x\X2. . . x„ of length n is denoted by xi:„. 
The empty string is denoted by e. The concatenation of two 



strings s and r is denoted by sr. The prefix xi-j of xi-n, 
j < n, is denoted by x<j or x<y+i. The notation generalises 



The optimal policy n* is the policy that maximises the 

expected future reward. The maximal achievable expected 

^ , , , . , , , future reward of an agent with history h in environment p 

for blocks of symbols: e.g. axi.,„ denotes a^x-ia2X2 . . . a„x„ looking m steps ahead is V'Nh) := v™(7r*, h). It is easy to 



and ax^j denotes a\X\a2X2 ■ ■ ■ Uj^xXj^i 

The following definition states that the environment takes 
the form of a probability distribution over possible percept 
sequences conditioned on actions taken by the agent. 

Definition 2 An environment p is a sequence of conditional 
probability functions {po,Pi,P2, • • • }, where p„: ^" -^ 
Density (X"), that satisfies 

Vai:„Vjc<„ : p„-i(jc<„ |a<„) = 2_j P«(-^l:« I «!:«)■ (1) 

x„eX 

In the base case, we have po(f | e) = 1. 

Equation [T] called the chronological condition in 
BHutOSI . captures the natural constraint that action a„ has 
no effect on observations made before it. For convenience, 
we drop the index n in p„ from here onwards. 

Given an environment p, 



pyXfi I ax^y^ayt) .- 



P(x\:n |fll:«) 
P\X<n I ^<n) 



(2) 



is the p-probability of observing x„ in cycle n given history 
h - ax^„an, provided p(x<„ | a<„) > 0. It now follows that 

P(xi:„ I fli:„) = p(xi I ai)p(x2 I axia2) ■ ■ -pix,, | flx<„fl„). (3) 

Definition |2] is used to describe both the true (but un- 
known) underlying environment and the agent's subjective 
model of the environment. The latter is called the agent's 
environment model and is typically learnt from data. Def- 
inition |2] is extremely general. It captures a wide variety 
of environments, including standard reinforcement learning 
setups such as MDPs and POMDPs. 

The agent's goal is to accumulate as much reward as it 
can during its lifetime. More precisely, the agent seeks 
a policy that will allow it to maximise its expected future 
reward up to a fixed, finite, but arbitrarily large horizon 
m e N. Formally, a policy is a function that maps a his- 
tory to an action. The expected future value of an agent 
acting under a particular poUcy is defined as follows. 

Definition 3 Given history axy,, the m-horizon expected 
future reward of an agent acting under policy n: {J[ X 
Xy — » ^ with respect to an environment p is 



Vp(n,axi;,) : = 



2j "/(fl-*^<f+m) 



(4) 



where for t + I < k < t + m, at '.— n{ax^k), ond 
Rk{aor<^t+m) '■— >'k- The quantity v^(7T,axi-,a,+i) is defined 
similarly, except that a,+\ is now no longer defined by n. 



see that if h = axi-, € (J?l x PC)', then 

V"'{h) = max V p(A-,+i | hci,+i }■■■ max V p(x,+,„ | haxm,, 



-lfl/+»,) 



Z' 



(5) 



We will refer to Equation|5]as the expectimax operation. 
The OT-horizon optimal action fl*^j at time f + 1 is related to 
the expectimax operation by 



ai-gmaxy'"(flxi:,fl(+i) 



(6) 



Eqs|4]and|5]can be modified to handle discounted reward, 
however we focus on the finite-horizon case since it both 
aligns with AIXI and allows for a simplified presentation. 



3 Bayesian Agents 

In the general reinforcement learning setting, the environ- 
ment p is unknown to the agent. One way to learn an envi- 
ronment model is to take a Bayesian approach. Instead of 
committing to any single environment model, the agent uses 
a mixture of environment models. This requires committing 
to a class of possible environments (the model class), as- 
signing an initial weight to each possible environment (the 
prior), and subsequently updating the weight for each model 
using Bayes rule (computing the posterior) whenever more 
experience is obtained. 

The above procedure is similar to Bayesian methods for 
predicting sequences of (singly typed) observations. The 
key difference in the agent setup is that each prediction is 
now also dependent on previous agent actions. We incor- 
porate this by using the action-conditional definitions and 
identities of Section |2] 

Definition 4 Given a model class M :— {pi,p2, ...) 
and a prior weight u^ > for each p e Ai such 
that 2peAl ^ ~ ^' ^^^ mixture environment model is 

^(■Xl:«|fll:n) := 2 H^p(jCl:„ I «!:„). 
peM 

The next result follows immediately. 

Proposition 1 A mixture environment model is an environ- 
ment model. 

Proposition [T| allows us to use a mixture environment 
model whenever we can use an environment model. Its im- 
portance will become clear shortly. 

To make predictions using a mixture environment model 
^, we use 

^{Xl:n\a\:n) 



S\Xn I ^-^<n^«) 



g(X<„ I a<«) 



(7) 



which follows from Proposition [T] and Eq. |2l The RHS of 
Eq.|2]can be written out as a convex combination of model 
predictions to give 



^(x„ I flJt:<„a„) = 2_] '^„-iP(^n I flJC<„fl„), 
peM 

where the posterior weight vv^ j forp is given by 
h{Jp(x<„ I a<„) 



(8) 



<-i 






\a<„) 



Pr(p I ax<„). 



(9) 



Bayesian agents enjoy a number of strong theoretical per- 
formance guarantees; these are explored in Section |6] In 
practice, the main difficulty in using a mixture environment 
model is computational. A rich model class is required if 
the mixture environment model is to possess general pre- 
diction capabilities, however naively using ([8]l for online 
prediction requires at least 0(| At|) time to process each new 
piece of experience. One of our main contributions, intro- 
duced in Section |5] is a large, efficiently computable mix- 
ture environment model that runs in time (9(log(log |A1|)). 
Before looking at that, we will examine in the next section 
a Monte Carlo Tree Search algorithm for approximating the 
expectimax operation. 

4 Monte Carlo Expectimax Approxi- 
mation 

Full-width computation of the expectimax operation (|5]l 
takes 0(\^xX\"') time, which is unacceptable for all but tiny 
values of m. This section introduces pUCT, a generalisation 
of the popular UCT algorithm IIKS06I that can be used to 
approximate a finite horizon expectimax operation given an 
environment model p. The key idea of Monte Carlo search 
is to sample observations from the environment, rather than 
exhaustively considering all possible observations. This al- 
lows for effective planning in environments with large ob- 
servation spaces. Note that since an environment model 
subsumes both MDPs and POMDPs, pUCT effectively ex- 
tends the UCT algorithm to a wider class of problem do- 
mains. 

The UCT algorithm has proven effective in solving large 
discounted or finite horizon MDPs. It assumes a generative 
model of the MDP that when given a state-action pair (s, a) 
produces a subsequent state-reward pair (s',r) distributed 
according to Pr{s',r\s,a). By successively sampling tra- 
jectories through the state space, the UCT algorithm incre- 
mentally constructs a search tree, with each node containing 
an estimate of the value of each state. Given enough time, 
these estimates converge to the true values. 

The pUCT algorithm can be realised by replacing the no- 
tion of state in UCT by an agent history h (which is always a 
sufficient statistic) and using an environment model p(or | h) 



to predict the next percept. The main subtlety with this ex- 
tension is that the history used to determine the conditional 
probabilities must be updated during the search to reflect 
the extra information an agent will have at a hypothetical 
future point in time. 

We will use *}* to represent all the nodes in the search tree, 
^{h) to represent the node corresponding to a particular his- 
tory h, Vp(h) to represent the sample-based estimate of the 
expected future reward, and T(h) to denote the number of 
times a node ^'(h) has been sampled. Nodes corresponding 
to histories that end or do not end with an action are called 
chance and decision nodes respectively. 

Algorithm[T]describes the top-level algorithm, which the 
agent calls at the beginning of each cycle. It is initialised 
with the agent's total experience h (up to time t) and the 
planning horizon m. It repeatedly invokes the Sample rou- 
tine until out of time. Importantly, pUCT is an anytime 
algorithm; an approximate best action, whose quality im- 
proves with time, is always available. This is retrieved by 
BestAction, which computes a* = argmaxy'"(flx<,flf). 

Algorithm 1 pUCT(/i, m) 

Require: A history h 

Require: A search horizon m e N 

1: InITIALISE(*I') 

2: repeat 

3: Sample(*I', h, m) 

4: until out of time 

5: return BestAction(*I', /i) 

Algorithm|2]describes the recursive routine used to sam- 
ple a single future trajectory. It uses the SelectAction rou- 
tine to choose moves at interior nodes, and invokes the 
Rollout routine at unexplored leaf nodes. The Rollout 
routine picks actions uniformly at random until the (remain- 
ing) horizon is reached, returning the accumulated reward. 
After a complete trajectory of length m is simulated, the 
value estimates are updated for each node traversed. Notice 
that the recursive calls on Lines l6l and [TTI append the most 
recent percept or action to the history argument. 

Algorithm [3] describes the UCB BAue02| poUcy used to 
select actions at decision nodes. The a and /3 constants 
denote the smallest and largest elements of fi respectively. 
The parameter C varies the selectivity of the search; larger 
values grow bushier trees. UCB automatically focuses at- 
tention on the best looking action in such a way that the 
sample estimate Vp(h) converges to Vp(h), whilst still ex- 
ploring alternate actions sufficiently often to guarantee that 
the best action will be found. 

The ramifications of the pUCT extension are particu- 
larly significant to Bayesian agents described in Section 
[3 Proposition [T] allows pUCT to be instantiated with a 
mixture environment model, which directly incorporates 



Algorithm 2 Sample(^, h, m) 



Require: A search tree ^F 

Require: A history h 

Require: A remaining search horizon m e '. 



1 


if m = then 


2 


return 


3 


else if ^'(h) is a chance node then 


4 


Generate (o, r) from p(or \ h) 


5 


Create node "^ihor) if r(/zor) = 


6 


reward «— r + Sample(*P, /zor, m- \) 


7 


else if r(/!) = then 


8 


reward «— Rollout(/!, m) 


9 


else 


10 


fl <— SelectAction(*I', h, m) 


11 


reward <— Sample(*I', ha, m) 


12 


end if 


13 


^(Z') ^ r(]^['-ewflrflf + r(/i)y(/j)] 


14 


r(/2) ^ T{h) + 1 


15 


return reward 



putes, at each time point f, the probability 

Pr(3'i:,) = J]Pi-(M)Pr(3;i:,|M), 



(10) 



where yi-,, is the binary sequence seen so far, M is a predic- 
tion suffix ti-ee I1RST96L Pr(M) is the prior probability of 
M, and the summation is over all prediction suffix trees of 
bounded depth D. A naive computation of (fTOl i takes time 
0(2 ); using CTW, this computation requires only 0(D) 
time. In this section, we outline how CTW can be extended 
to compute probabiUties of the form 



Pr(^i:,|fli:,) = J]Pr(M)Pr(xi 

M 



|M,fli:,), (11) 



where x\-, is a percept sequence, ai:, is an action sequence, 
and M is a prediction suffix tree as in ( flOl l. This extension 
allows CTW to be used as a mixture environment model 
(Definition |4| in the pUCT algorithm, where we combine 
( fTTT ) and ^ to predict the next percept given a history. 



model uncertainty into the planning process. This gives 
(in principle, provided that the model class contains the 
true environment and ignoring issues of limited compu- 
tation) the well known Bayes-optimal solution to the ex- 
ploration/exploitation dilemma; namely, if a reduction in 
model uncertainty would lead to higher expected future re- 
ward, pUCT would recommend an information gathering 
action. 

Algorithm 3 SelectAction(*I', /z, m) 

Require: A search tree ^F 

Require: A history h 

Require: A remaining search horizon m e N 

Require: An exploration/exploitation constant C > 



1 


^l^[aeJ^.■. T(ha) = 0} 


2 


if 1/ ^ {} then 


3 


Pick a ell uniformly at random 


4 


Create node ^(ha) 


5 


return a 


6 


else 


7 


return argma^x(;;^y(M + C^l^S^ 


8 


end if 



5 Action-Conditional CTW 

We now introduce a large mixture environment model for 
use with pUCT. Context Tree Weighting (CTW) IIWST95I 
is an efficient and theoretically well-studied binary se- 
quence prediction algorithm that works well in practice. It 
is an online Bayesian model averaging algorithm that com- 



Krichevsky-Trofimov Estimator. We start with a brief 
review of the KT estimator for Bernoulli distributions. 
Given a binary string y\-, with a zeroes and b ones, the KT 
estimate of the probability of the next symbol is given by 



Pr„(F,+i = l|yi:,):= 



b+ 111 
a + b +\' 



(12) 



The KT estimator can be obtained via a Bayesian analysis 
by putting an uninformative (Jeffreys Beta(l/2,l/2)) prior 
Pr(6i) cc r'/2(i _ 0)-i/2 on the parameter 6 e [0, 1] of 
the Bernoulli distribution. The probability of a string yi-, 
is given by 



Prfa(yi:f) = Pr^fCvi | e)Prkt(y2\y\)- 
= j0^(l-0rPT(0)d6. 



■Prk,(y,\y<,) 



Prediction SufHx Trees. We next describe prediction suf- 
fix trees. We consider a binary tree where all the left edges 
are labelled 1 and all the right edges are labelled 0. The 
depth of a binary tree M is denoted by d(M). Each node 
in M can be identified by a string in {0, 1)* as usual: e rep- 
resents the root node of M; and if n e {0, 1 )* is a node 
in M, then «1 and nO represent respectively the left and 
right children of node n. The set of M's leaf nodes is de- 
noted by L{M) c {0, 1}*. Given a binary string yi-, where 
t > d(M), we define M(yi;t) :- ytyt-\ ■ ■ -yv, where f' < f is 
the (unique) positive integer such that ytyt-i ■ ■ -yr e L{M). 

Definition 5 A prediction suffix tree (PST) is a pair (M, 0), 
where M is a binary tree and associated with each I e L{M) 
is a distribution over {0, 1} parameterised by 6i e 0. We call 
M the model of the PST and the parameter of the PST. 



A PST (M, 0) maps each binary string yi-,,, t > d(M), to 
6M(yt.,y, the intended meaning is that 0m(vi ,) is the probability 
that the next bit following yi-, is 1 . For example, the PST in 
Figure[T]maps the string 1 1 10 to 0m(iiio) = ^oi = 0.3, which 
means the next bit after 1110 is 1 with probability 0.3. 



o 



01 =0.1 



o 



001 = 0.3 000 = 0.5 

Figure 1 : An example prediction suffix tree 



Action-Conditional PST. In the agent setting, we reduce 
the problem of predicting history sequences with general 
non-binary alphabets to that of predicting the bit represen- 
tations of those sequences. Further, we only ever condition 
on actions; this is achieved by appending bit representations 
of actions to the input sequence without updating the PST 
parameters. 

Assume \X\ = 2'-^ for some l,x > 0. Denote by [jcj = 
■^[1>^A'] - Jc[l].x[2] . . .x[lx] the bit representation of x e A'. 
Denote by [xi,]] = [[xiH-^al ■ ■ ■ I-^d the bit representation 
of a sequence xi-,. Action symbols are treated similarly. 

To do action-conditional sequence prediction using a 
PST with a given model M but unknown parameter, we start 
with 6i := Pi-kiil I e) = 1/2 at each I e L{M). We set aside 
an initial portion of the binary history sequence to initialise 
the variable h and then repeat the following steps as long as 
needed: 

1. set /z :- h^aj, where a is the current selected action; 

2. for / := Itolx do 

(a) predict the next bit using the distribution 9M(h)', 

(b) observe the next bit x[i], update 0m(A) using (fT2T i 
according to the value of x{i\, and then set h :- 
hx[i\. 

Let M be the model of a prediction suffix tree, a\-, an 
action sequence, xij a percept sequence, and h :- \axi-ij. 
For each node n in M, define hM,n by 

hM,n'-^hiJii,---hi^ (13) 

where I < ii < i2 < ■ ■ ■ < ik '^ t and, for each /, 
i 6 {i\,h, ■ ■ ■ ik] iff hi is a percept bit and n is a prefix of 
M{h\-i-\). We have the following expression for the proba- 
bility of xi:, given M and ax-,: 

I l.x 
Pr(xi:, I M, fli:,) = n n ^'"^^'[•^'] I ^' I[«^<'«']1^'[1, ;■ - 1]) 

/=1 j=\ 

Pnt{hM,n). (14) 

«eL(AO 



n 



Context Tree Weighting. The above deals with action- 
conditional prediction using a single PST. We now show 
how we can efficiently perform action-conditional predic- 
tion using a Bayesian mixture of PSTs. There are two main 
computational tricks: the use of a data structure to represent 
all PSTs of a certain maximum depth and the use of proba- 
bilities of sequences in place of conditional probabilities. 

Definition 6 A context tree of depth D is a perfect binary 
tree of depth D such that attached to each node (both inter- 
nal and leaf) is a probability on {0, 1}*. 

The weighted probability P'l^, of each node n in the con- 
text tree T after seeing h := |[axi:,l is defined as follows: 



P". : = 



Prkl(hT,n) 



if n is a leaf node; 



■ Prk,(hT,„) + 5^f X P;;i otherwise. 



The following is a straightforward extension of a result 
dueto aWST951 . 

Lemma 1 Let T be the depth-D context tree after seeing 
h :— ][axi;ij. For each node n in T at depth d, we have 



P". 



z^- 



n Prk,(hT,nd, (15) 

MeCo-d leL(M) 

where Cd is the set of all models of PSTs with depth < d, 
and TdiM) is the code-length for M given by the number of 
nodes in M minus the number of leaf nodes in M of depth 
d. 

A corollary of Lemma [1] is that at the root node e of the 
context tree T after seeing h :- laxi-tj, we have 



PlixlM:,) - J] 2-r«W Y] PTkr(hT,l) (16) 

MeCo leUM) 

. 2 2-r''<^' f] Prk,(hM,,) (17) 

MeCo leUM) 

= Y, 2-r«<^)Pr(xi:,|M,fli:,), (18) 

MeCz, 

where the last step follows from (fT4l l. Notice that the prior 
2-rz)() penalises PSTs with large tree structures. The con- 
ditional probability of x, given ax^,a, can be obtained from 
^. We can also efficiently sample the individual bits of x, 
one by one. 

Computational Complexity. The Action-Conditional 
CTW algorithm grows the context tree dynamically. Us- 
ing a context tree with depth D, there are at most 
(9(fZ)log(|(9||'R|)) nodes in the context tree after t cycles. 
In practice, this is a lot less than 2^, the number of 
nodes in a fully grown context tree. The time complexity 
of Action-Conditional CTW is also impressive, requiring 
OiDlogdOWRD) time to process each new piece of agent 
experience and OimDlogdOWRD) to sample a single trajec- 
tory when combined with pUCT. Importantly, this is inde- 
pendent of t, which means that the computational overhead 
does not increase as the agent gathers more experience. 



6 Theoretical Results 



lows. For all « e N, for all a\;, 



Putting the pUCT and Action-Conditional CTW algorithms 
together yields our approximate AIXI agent. We now in- 
vestigate some of its properties. 

Model Class Approximation. By instantiating (|5]i with 
the mixture environment model (fTSl l. one can show that the 
optimal action for an agent at time t, having experienced 
ax^t, is given by 

argmax V ••■max V V r, V 2"'"o'^'Pr(x,.,+,„ |M,ai:,+„). 

Xt -\i+m L i=! J MeCd 

Compare this to the action chosen by the AIXI agent 



arg max V . . . max V V n V 2 '^^^ p(xi;,+„, \ a 

at ^ — ^ rt;+m ^ — ^ 4 — ^ ^ — ^ 



ll:t+m) 



peM 



where class Ai consists of all computable environments p 
and K(p) denotes the Kolmogorov complexity of p. Both 
use a prior that favours simplicity. The main difference is 
in the subexpression describing the mixture over the model 
class. AIXI uses a mixture over all enumerable chronolog- 
ical semimeasures, which is completely general but incom- 
putable. Our approximation uses a mixture of all prediction 
suffix trees of a certain maximum depth, which is still a 
rather general class, but one that is efficiently computable. 



Consistency of pUCT. IIKS06I shows that the UCT al- 
gorithm is consistent in finite horizon MDPs and derive fi- 
nite sample bounds on the estimation error due to sampling. 
By interpreting histories as Markov states, the general re- 
inforcement learning problem reduces to a finite horizon 
MDP and the results of IIKS06I are now directly applica- 
ble. Restating the main consistency result in our notation, 
we have 



VeV/i lim Pv(\V'''(h)-V^(h)\<€) 



1. 



(19) 



Furthermore, the probability that a suboptimal action (with 
respect to y'"(-)) is chosen by pUCT goes to zero in the 
limit. 



Convergence to True Environment. The next result, 
adapted from llHut05l . shows that if there is a good model of 
the (unknown) environment in Co, then Action-Conditional 
CTW will predict well. 

Theorem 1 Let ji be the true environment, and T = P^, 
the mixture environment model formed from M8h The ji- 
expected squared difference of ji and T is bounded as fol- 



k= I -»'<t Xt 

< min|ro(M)ln2+/:L(//(-|fli:„)||Pr(-|M,fli:„))|, 



(20) 



where KL{- \\ •) is the KL divergence of two distributions. 

If the RHS of (|20] l is finite over all n, then the sum on the 
LHS can only be finite if T converges sufficiently fast to ft. 
If KL grows sublinear in n, then T still converges to fi (in a 
weaker Cesaro sense), which is for instance the case for all 
A:-order Markov and all stationary processes /i. 

Overall Result. Theorem [T] above in conjunction with 
|Hut05,, Thm.5.36] imply V!"{h) converges to V'''{h) as long 
as there exists a model in the model class that approximates 
the unknown environment fi well. This, and the consistency 
( fT9] l of the pUCT algorithm, imply that V!^(h) converges to 
V^(h). More detail can be found in IIVNHS09I . 



7 Experimental Results 

This section evaluates our approximate AIXI agent on a va- 
riety of test domains. The Cheese Maze, 4x4 Grid and Ex- 
tended Tiger domains are taken from the POMDP litera- 
ture. The TicTacToe domain comprises a repeated series of 
games against an opponent who moves randomly. The Bi- 
ased RockPaperScissor domain is described in MFMWR07I . 
which involves the agent repeatedly playing RockPaper- 
Scissor against an exploitable opponent. Two more chal- 
lenging domains are included: Kuhn Poker BHSHB05I . 
where the agent plays second against a Nash optimal player 
and a partially observable version of Pacman described in 
II VNHS09I . With the exception of Pacman, each domain 
has a known optimal solution. Although our domains are 
modest, requiring the agent to learn the environment from 
scratch significantly increases the difficulty of each of these 
problems. 
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Cheese Maze 


4 


16 
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96 
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Tiger 
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96 
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4x4 Grid 
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96 


12 


TicTacToe 


9 
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Biased RPS 


3 


3 


2/2/2 


32 


4 


Kuhn Poker 


2 


6 


1/4/3 


42 


2 


Pacman 


4 


65536 


2/16/8 


64 


8 



Table 1: Parameter Configuration 

Table [Uoutlines the parameters used in each experiment. 
The sizes of the action and observation spaces are given. 
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Figure 2: Learning scalability results 



1000000 



along with the number of bits used to encode each space. 
The context depth parameter D specifies the maximal num- 
ber of recent bits used by the Action-Conditional CTW pre- 
diction scheme. The search horizon is given by the parame- 
ter m. Larger D and m increase the capabilities of our agent, 
at the expense of linearly increasing computation time; our 
values represent an appropriate compromise between these 
two competing dimensions for each problem domain. 

Figure |2] shows how the performance of the agent scales 
with experience, measured in terms of number of interac- 
tion cycles. Experience was gathered by a decaying e- 
greedy policy, which chose randomly or used pUCT. The 
results are normalised with respect to the optimal average 
reward per time step, except in Pacman, where we nor- 
malised to an estimate. Each data point was obtained by 
starting the agent with an amount of experience given by the 
X-axis and running it greedily for 2000 cycles. The amount 
of search used for each problem domain, measured by the 
number of pUCT simulations per cycle, is given in Table 
|2] (The average search time per cycle is also given.) The 
agent converges to optimality on all the test domains with 
known optimal values, and exhibits good scaling properties 
on our challenging Pacman variant. Visual inspectiorjj of 
Pacman shows that the agent, whilst not playing perfectly, 
has already learnt a number of important concepts. 

Table |2] summarises the resources required for approxi- 
mately optimal performance on our test domains. Timing 
statistics were collected on an Intel dual 2.53Ghz Xeon. 
Domains that included a planning component such as Tiger 
required more search. Convergence was somewhat slower 
in TicTacToe; the main difficulty for the agent was learn- 
ing not to lose the game immediately by playing an illegal 
move. Most impressive was that the agent learnt to play an 
approximate best response strategy for Kuhn Poker, without 
knowing the rules of the game or the opponent's strategy. 



Domain 


Experience 


Simulations 
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Cheese Maze 


5x 10-* 


500 


0.9s 


Tiger 


5x 10^ 


10000 


10.8s 


4x4 Grid 


2.5 X 10-* 


1000 


0.7s 


TicTacToe 


5x 10^ 


5000 


8.4s 


Biased RPS 


Ix 10^ 


10000 


4.8s 


Kuhn Poker 


5x 10" 


3000 


1.5s 



Table 2: Resources required for optimal performance 

8 Related Work 



'http://www.youtube.com/watch?v=RhQTWidQQ8U 



The BLHT algorithm MSH99II is closely related to our work. 
It uses symbol level PSTs for learning and an (unspecified) 
dynamic programming based algorithm for control. BLHT 
uses the most probable model for prediction, whereas we 
use a mixture model, which admits a much stronger con- 
vergence result. A further distinction is our usage of an 
Ockham prior instead of a uniform prior over PST models. 

The Active-LZ OFMWR07II algorithm combines a 
Lempel-Ziv based prediction scheme with dynamic pro- 
gramming for control to produce an agent that is provably 
asymptotically optimal if the environment is n-Markov. We 
implemented the Active-LZ test domain. Biased RPS, and 
compared against their published results. Our agent was 
able to achieve optimal levels of performance within 10* 
cycles; in contrast, Active-LZ was still suboptimal after 10^ 
cycles. 

U-Tree l|McC96ll is an online agent algorithm that at- 
tempts to discover a compact state representation from a 
raw stream of experience. Each state is represented as the 
leaf of a suffix tree that maps history sequences to states. As 
more experience is gathered, the state representation is re- 
fined according to a heuristic built around the Kolmogorov- 
Smirnov test. This heuristic tries to limit the growth of the 
suffix tree to places that would allow for better prediction 
of future reward. Value Iteration is used at each time step 
to update the value function for the learned state represen- 
tation, which is then used by the agent for action selection. 

It is instructive to compare and contrast our AIXI approx- 
imation with the Active-LZ and U-Tree algorithms. The 
small state space induced by U-Tree has the benefit of lim- 
iting the number of parameters that need to be estimated 
from data. This has the potential to dramatically speed up 
the model-learning process. In contrast, both Active-LZ 
and our approach require a number of parameters propor- 
tional to the number of distinct contexts. This is one of 
the reasons why Active-LZ exhibits slow convergence in 
practice. This problem is much less pronounced in our ap- 
proach for two reasons. First, the Ockham prior in CTW 
ensures that future predictions are dominated by PST struc- 
tures that have seen enough data to be trustworthy. Sec- 
ondly, value function estimation is decoupled from the pro- 
cess of context estimation. Thus it is reasonable to ex- 
pect pUCT to make good local decisions provided Action- 
Conditional CTW can predict well. The downside however 



is that our approach requires search for action selection. 
Although pUCT is an anytime algorithm, in practice more 
computation is required per cycle compared to approaches 
like Active-LZ and U-Tree that act greedily with respect to 
an estimated global value function. 

The U-Tree algorithm is well motivated, but unlike 
Active-LZ and our approach, it lacks theoretical perfor- 
mance guarantees. It is possible for U-Tree to prema- 
turely converge to a locally optimal state representation 
from which the heuristic splitting criterion can never re- 
cover. Furthermore, the splitting heuristic contains a num- 
ber of configuration options that can dramatically influence 
its performance |McC96J. This parameter sensitivity some- 
what limits the algorithm's applicability to the general rein- 
forcement learning problem. 

Our work is also related to Bayesian Reinforcement 
Learning. In model-based Bayesian RL 0PVO8I IStrOOI . a 
distribution over (PO)MDP parameters is maintained. In 
contrast, we maintain an exact Bayesian mixture of PSTs. 
The pUCT algorithm shares similarities with Bayesian 
Sparse Sampling IIWLB S05I: the key diff'erences are es- 
timating the leaf node values with a rollout function and 
guiding the search with the UCB policy. 

A more comprehensive discussion of related work can be 
found in IIVNHS09L 



9 Limitations 

The main limitation of our current AIXI approximation is 
the restricted model class. Our agent will perform poorly if 
the underlying environment cannot be predicted well by a 
PST of bounded depth. Prohibitive amounts of experience 
will be required if a large PST model is needed for accurate 
prediction. For example, it would be unrealistic to think that 
our current AIXI approximation could cope with real-world 
image or audio data. 

The identification of efficient and general model classes 
that better approximate the AIXI ideal is an important area 
for future work. Some preliminary ideas are explored in 
IIVNHS09I . 



10 Conclusion 

We have introduced the first computationally tractable ap- 
proximation to the AIXI agent and shown that it provides 
a promising approach to the general reinforcement learn- 
ing problem. Investigating multi-alphabet CTW for pre- 
diction, parallelisation of pUCT, further expansion of the 
model class (ideally, beyond variable-order Markov mod- 
els) or more sophisticated rollout policies for pUCT are ex- 
citing areas for future investigation. 



11 Acknowledgements 

This work received support from the Australian Research 
Council under grant DP0988049. NICTA is funded by the 
Australian Government as represented by the Department of 
Broadband, Communications and the Digital Economy and 
the Australian Research Council through the ICT Centre of 
Excellence program. 

References 

[Aue02] Peter Auer. Using confidence bounds for 
exploitation-exploration trade-off's. JMLR, 3:397^22, 
2002. 

[FMWR07] V. Farias, C. Moallemi, T. Weissman, and 
B. Van Roy. Universal Reinforcement Learning. CoRR, 
abs/0707.3087,2007. 

[HSHB05] Bret Hoehn, Finnegan Southey, Robert C. 
Holte, and Valeriy Bulitko. Effective short-term oppo- 
nent exploitation in simplified poker InAAAVOS, pages 
783-788, 2005. 

[Hut05] Marcus Hutter. Universal Artificial Intelligence: 
Sequential Decisions Based on Algorithmic Probability. 
Springer, 2005. 

[KS06] Levente Kocsis and Csaba Szepesvari. Bandit 
based Monte-Carlo planning. In ECML, pages 282-293, 
2006. 

[McC96] Andrew Kachites McCallum. Reinforcement 
Learning with Selective Perception and Hidden State. 
PhD thesis. University of Rochester, 1996. 

[PV08] Pascal Poupart and Nikos Vlassis. Model-based 
Bayesian Reinforcement Learning in Partially Observ- 
able Domains. In ISAM, 2008. 

[RST96] D. Ron, Y. Singer, and N. Tishby. The power 
of amnesia: Learning probabilistic automata with vari- 
able memory length. Machine Learning, 25(2):117-150, 
1996. 

[SH99] Nobuo Suematsu and Akira Hayashi. A reinforce- 
ment learning algorithm in partially observable environ- 
ments using short-term memory. In NIPS, pages 1059- 
1065, 1999. 

[StrOO] M. Strens. A Bayesian framework for reinforce- 
ment learning. In ICML, pages 943-950, 2000. 

[VNHS09] Joel Veness, Kee Siong Ng, Marcus Hutter, and 
David Silver. A Monte Carlo AIXI Approximation. 
CoRR, abs/0909.0801, 2009. 

[WLBS05] T. Wang, D.J. Lizotte, M.H. Bowling, and 
D. Schuurmans. Bayesian sparse sampling for on-line 
reward optimization. In ICML, pages 956-963, 2005. 

[WST95] Frans M.J. Willems, Yuri M. Shtarkov, and 
Tj ailing J. Tjalkens. The Context Tree Weighting 
Method: Basic Properties. IEEE Transactions on Infor- 
mation Theory, 41:653-664, 1995. 



